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LEAVE PIMA INDIANS ALONE: BINARY REGRESSION AS A 
BENCHMARK FOR BAYESIAN COMPUTATION 

NICOLAS CHOPIN AND JAMES RIDGWAY 


Abstract. Whenever a new approach to perform Bayesian computation is 
introduced, a common practice is to showcase this approach on a binary re¬ 
gression model and datasets of moderate size. This paper discusses to which 
extent this practice is sound. It also reviews the current state of the art of 
Bayesian computation, using binary regression as a running example. Both 
sampling-based algorithms (importance sampling, MCMC and SMC) and fast 
approximations (Laplace and EP) are covered. Extensive numerical results are 
provided, some of which might go against conventional wisdom regarding the 
effectiveness of certain algorithms. Implications for other problems (variable 
selection) and other models are also discussed. 


1. Introduction 


The field of Bayesian computation seems hard to track these days, as it is blos¬ 
soming in many directions. MCMC (Markov chain Monte Carlo) remains the main 
approach, but it is no longer restricted to Gibbs sampling and Hastings-Metropolis, 
as it includes more advanced. Physics-inspired methods, such as HMC (Hybrid 


Monte Garlo, 

Neal 

2010 

1 and its variants ( 

Girolami and Galderhead, 2011||Shah-| 

baba et al. 

2011 

Hoffman and Gelman 2013 

|. On the other hand, there is also a 


e.g. Del Moral et al.[ 2QQ6[ ), nested sampling (Skilling, 2QQ6[ ), or the fast approx¬ 


imations that originated from machine learning, such as Variational Bayes (e.g. 
Chap. 10), and EP (Expectation Propagation, 


Bishop 2006 


Minka 20011. Even 


Laplace approximation has resurfaced in particular thanks to the INLA methodo¬ 
logy ( |Rue et ah] 2009| |. 


One thing however that all these approaches have in common is they are almost 
always illustrated by a binary regression example; see e.g. the aforementioned 
papers. In other words, binary regressions models, such as probit or logit, are a de 
facto benchmark for Bayesian computation. 

This remark leads to several questions. Are binary regression models a reason¬ 
able benchmark for Bayesian computation? Should they be used then to develop 
a ‘benchmark culture’ in Bayesian computation, like in e.g. optimisation? And 
practically, which of these methods actually ‘works best’ for approximating the 
posterior distribution of a binary regression model? 

The objective of this paper is to answer these questions. As the ironic title 
suggests, our findings shall lead to us be critical of certain current practices. Spe¬ 
cifically, most papers seem content with comparing some new algorithm with Gibbs 
sampling, on a few small datasets, such as the well-known Pima Indians diabetes 
dataset (8 covariates). But we shall see that, for such datasets, approaches that are 
even more basic than Gibbs sampling are actually hard to beat. In other words, 
datasets considered in the literature may be too toy-like to be used as a relevant 
benchmark. On the other hand, if ones considers larger datasets (with say 100 
covariates), then not so many approaches seem to remain competitive. 
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We would also like to discuss how Bayesian computation algorithms should be 
compared. One obvious criterion is the error versus CPU time trade-off; this im¬ 
plies discussing which posterior quantities one may be need to approximate. A 
related point is whether the considered method comes with a simple way to evalu¬ 
ate the numerical error. Other criteria of interest are: (a) how easy to implement 
is the considered method? (b) how generic is it? (does changing the prior or the 
link function require a complete rewrite of the source code?) (c) to which extent 
does it require manual tuning to obtain good performances? (d) is it amenable to 
parallelisation? Points (a) and (b) are rarely discussed in Statistics, but relate to 
the important fact that, the simpler the program, the easier it is to maintain, and 
to make it bug-free. Regarding point (c), we warn beforehand that, as a matter 
of principle, we shall refuse to manually tune an algorithm on a per dataset basis. 
Rather, we will discuss, for each approach, some (hopefully reasonable) general re¬ 
cipe for how to choose the tuning parameters. This has two motivations. First, 
human time is far more valuable that computer time: Cook ( 2014[ | mentions that 
one hour of CPU time is today three orders of magnitude less expensive than one 
hour of pay for a programmer (or similarly a scientist). Second, any method re¬ 
quiring too much manual tuning through trial and error may be practically of no 
use beyond a small number of experts. 

Finally, we also hope this paper may serve as an up to date review of the state 
of Bayesian computation. We believe this review to be timely for a number of reas¬ 
ons. First, as already mentioned, because Bayesian computation seems to develop 
currently in several different directions. Second, and this relates to criterion (d). 


the current interest in parallel computation (Lee et al. 2010 Suchard et al., 20101 


may require a re-assessment of Bayesian computational methods: method A may 
perform better than method B on a single core architecture, while performing much 
worse on a parallel architecture. Finally, although the phrase ‘big data’ seems to 
be a tired trope already, it is certainly true that datasets are getting bigger and 
bigger, which in return means that statistical methods needs to be evaluated on 
bigger and bigger datasets. To be fair, we will not really consider in this work the 
kind of huge datasets that pertain to ‘big data’, but we will at least strive to move 
away from the kind of ‘ridiculously small’ data encountered too often in Bayesian 
computation papers. 

The paper is structured as follows. Section covers certain useful preliminaries 
on binary regression models. Section discusses fast approximations, that is, de¬ 
terministic algorithms that offer an approximation of the posterior, at a lower cost 
than sampling-based methods. Section discusses ‘exact’, sampling-based meth¬ 
ods. Section is the most important part of the paper, as it contains an extensive 
numerical comparison of all these methods. Section discusses variable selection. 
Section discusses our findings, and their implications for both end users and 
Bayesian computation experts. 


2. Preliminaries: binary regression models 


2.1. Likelihood, prior. The likelihood of a binary regression model have the gen¬ 
eric expression 

riT) 

( 2 . 1 ) p{V\f3) = ^F{y,f3^x,) 

i=l 


where the data T> consist of n responses yi G {—1,1} and n vectors Xi of p covariates, 
and F is some CDF (cumulative distribution function) that transforms the linear 
form yi0^Xi into a probability. Taking F — the standard normal CDF, gives 
the probit model, while taking F = L, the logistic CDF, L{x) = 1/ (1 -|- e~^), leads 
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to the logistic model. Other choices could be considered, such as e.g. the CDF of 
a Student distribution (robit model) to better accommodate outliers. 

We follow Gelman et al. (20081’s recommendation to standardise the predictors 
in a preliminary step: non-binary predictors have mean 0 and standard deviation 
0.5, binary predictors have mean 0 and range 1, and the intercept (if present) is 
set to 1. This standardisation facilitates prior specification: one then may set up a 
“weakly informative” prior for /3, that is a proper prior that assigns a low probability 
that the marginal effect of one predictor is outside a reasonable range. Specifically, 
we shall consider two priors p{f3) in this work: (a) the default prior recommended by 


Gelman et al. (20081, a product of independent Gauchys with centre 0 and scale 10 


for the constant predictor, 2.5 for all the other predictors (henceforth, the Gauchy 
prior); and (b) a product of independent Gaussians with mean 0 and standard 
deviation equal to twice the scale of the Gauchy prior (henceforth the Gaussian 
prior). 


Of course, other priors could be considered, such as e.g. Jeffreys’ prior (Firth 
l9^ , or a Laplace prior (|Kaban[ |2007[| . Our main point in considering the two 


priors above is to determine to which extent certain Bayesian computation methods 
may be prior-dependent, either in their implementation (e.g. Gibbs sampling) or 
in their performance, or both. In particular, one may expect the Gauchy prior to 
be more difficult to deal with, given its heavy tails. 


2.2. Posterior maximisation (Gaussian prior). We explain in this section how 
to quickly compute the mode, and the Hessian at the mode, of the posterior: 


^ p{p)p{p\j3)d(3, 


where p(/9) is one of the two priors presented in the previous section, and Z{T>) is 
the marginal likelihood of the data (also known as the evidence). These quantities 
will prove useful later, in particular to tune certain of the considered methods. 
The two first derivatives of the log-posterior density may be computed as: 


d(3d(3'^ 


\ogp{(3\V) 

\ogp{^\V) 


d d 

g^logA« + 


\ogp{V\(3), 

akw 


where 


A 


n-D 

\ogp{V\l3) = ^ (logF)' {yi0^Xi)y^x^ 


92 

d(3d(3T 


n-D 

\ogp{V\l3) = A (log J^)" x,)xixj 

i=l 


and (logF)' and (logP)" are the two first derivatives of log F. Provided that logF 
is concave, which is the case for probit and logit regressions, the Hessian of the 
log-likelihood is clearly a negative definite matrix. Moreover, if we consider the 
Gaussian prior, then the Hessian is of the log-posterior is also negative (as the sum 
of two negative matrices, as Gaussian densities are log-concave). We stick to the 
Gaussian prior for now. 

This suggests the following standard approach to compute the MAP (maximum 
a posterior) estimator, that is the point ,9map that maximises the posterior density 
p{(3\'D): to use Newton-Raphson, that is, to iterate 
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(2.2) /3(new) = /3(old) - H ^ logp(/3(old)|2?)| 

until convergence is reached; here H is Hessian of the log posterior at 13 — /9(oid)) 
as computed above. The iteration above corresponds to finding the zero of a local, 
quadratic approximation of the log-posterior. Newton-Raphson typically works 
very well (converges in a small number of iterations) when the function to maximise 
is concave. A variant of this approach is 

We note two points in passing. First, one may obtain the MLE (maximum 
likelihood estimator) by simply taking p{f3) = 1 above (i.e. a Gaussian with in¬ 
finite variance). But the MLE is not properly defined when complete separation 
occurs, that is, there exists a hyperplane that separates perfectly the two outcomes: 
yi0Q^Xi > 0 for some ,3cs and all z G 1 : iV. This remark gives an extra incentive 
for performing Bayesian inference, or at least MAP estimation, in cases where com¬ 
plete separation may occur, in particular when the number of covariates is large 


Firth 

1993 

Gelman et al. 

2008 


Variants of Newton-Raphson may be obtained by adapting automatically the 
step size (e.g. update is /3(new) = /3(oid) - | ^ logp(/3(oid) l^?)}, and step size 

A is determined by line search) or replacing the Hessian H by some approximation. 
Some of these algorithms such as IRLS (iterated reweighted least squares) have 
a nice statistical interpretation. For our purposes however, these variants seem 
to show roughly similar performance, so we will stick to the standard version of 
N ewton-Raphson. 


2.3. Posterior maximisation (Cauchy prior). The log-density of the Cauchy 
prior is not concave: 


logp(^) = - ^ log (ttctj) - ^ log(l -k /3 |/ct|) 
t=i i=i 

Hence, the corresponding log- 


for scales cr, chosen as explained in Section 2.1 


posterior is no longer guaranteed to be concave, which in turn means that Newton- 
Raphson might fail to converge. 

However, we shall observe that, for most of the datasets considered in this paper, 
Newton-Raphson does converge quickly even for our Cauchy prior. In each case, we 
used as starting point for the Newton-Raphson iterations the OLS (ordinary least 
square) estimate. We suspect what happens is that, for most standard datasets, 
the posterior derived from a Cauchy prior remains log-concave, at least in a region 
that encloses the MAP estimator and our starting point. 


3. Fast approximation methods 


This section discusses fast approximation methods, that is methods that are 
deterministic, fast (compared to sampling-based methods), but which comes with 
an approximation error which is difficult to assess. These methods include the 
Laplace approximation, which was popular in Statistics before the advent of MCMC 
methods, but also recent Machine Learning methods, such as EP (Expectation 
Propagation, Minka 20011, and VB (Variational Bayes, e.g. Bishop[ 2006[ Ch ap. 
10). We will focus on Laplace and EP; for VB, see Consonni and Marin[ ( MT7| for 
a discussion of why VB (or at least a certain standard version of VB, known as 
mean field VB) may not work so well for probit models. 

Concretely, we will focus on the approximation of the following posterior quant¬ 
ities: the marginal likelihood p{T>), as this may be used in model choice; and the 
marginal distributions p(/3i|I?) for each component (3i of (3. Clearly these are the 
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most commonly used summaries of the posterior distribution, and other quantities, 
such as the posterior expectation of 13, may be directly deduced from them. 

Finally, one should bear in mind that such fast approximations may be used as 
a preliminary step to calibrate an exact, more expensive method, such as those 
described in Section ID 

3.1. Laplace approximation. The Laplace approximation is based on a Taylor 
expansion of the posterior log-density around the mode /3map : 


logp(/3|X>) « logp(/3MAp|2?) - ^ (/3 - /3map)^ Q (/3 - /3map) , 
where Q = -H, i.e. minus the Hessian of \ogp{P\'D) at /3 = /3map; recall that 


we explained how to compute these quantities in Section 2.2 One may deduce 


a Gaussian approximation of the posterior by simply exponentiating the equation 
above, and normalising: 

(3.1) gL(/3)=iVp(^;/3MAP,Q'') 

:= (271)“^’/^ IQI^/^exp {(3 - /3map)^Q - /3map)| ■ 

In addition, since for any (3, 

(j.-, _ P(/3)p(P|^) 

^ Pim 

one obtains an approximation to the marginal likelihood p{'D) as follows: 

p{V) « Z,{V) := p(^map)p(P|/3map) ^ 

(27r)-P/2|Q|i/2 

From now on, we will refer to this particular Gaussian approximation as the 
Laplace approximation, even if this phrase is sometimes used in Statistics for higher- 
order approximations, as discussed in the next Section. We defer to Sectionthe 
discussion of the advantages and drawbacks of this approximation scheme. 

3.2. Improved Laplace, connection with INLA. Gonsider the marginal distri¬ 
butions p{j3j\'D) = f p(/3l'D)d/3-j for each component /3j of (3, where /3_j is f3 minus 
Pj. A first approximation may be obtained by simply computing the marginals of 
the Laplace approximation q^. An improved (but more expensive) approximation 
may be obtained from: 

(R ITD ^ P(/3)p(^l/3) 

^ p{(3-,\P,,V) 

which suggests to choose a fine grid of /3j values (deduced for instance from gi(/3)), 
and for each f3j value, compute a Laplace approximation of p{/3-j\Pj,'D), by com¬ 
puting the mode and the Hessian H{Pj) of logp{f3-j\f3j,'D), and then 

approximate (up to a constant) 


pi/3j\V) « qiUPj) oc 


p(^(/3,))p(I?TO)) 




1/2 


where f3{/3j) is the vector obtained by inserting j3i at position i in and 

IL stands for “Improved Laplace”. One may also deduce posterior expectations of 
functions of /3j in this way. See also Tierney and Kadane (19861, Tierney et al. 


(19891 for higher order approximations for posterior expectations. 

We note in passing the connection to the INLA scheme of Rue et al. ( 2009[ ). INLA 
applies to posteriors p{9,x\'D) where x is a latent variable such that p{x\6,'D) is 
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close to a Gaussian, and 0 is a low-dimensional hyper-parameter. It constructs 
a grid of 0— values, and for each grid point Oj, it computes an improve Laplace 
approximation of the marginals of p{x\6j,'D). In our context, (3 may be identified 
to X, 0 to an empty set, and INLA reduces to the improved Laplace approximation 
described above. 


3.3. The EM algorithm of Gelman et al. (2008) (Cauchy prior). Gelmanj 


et al. (20081 recommend against the Laplace approximation for a Student prior (of 


which our Gauchy prior is a special case), because, as explained in Section [2T3| the 
corresponding log-posterior is not guaranteed to be concave, and this might prevent 
Newton-Raphson to converge. In our simulations however, we found the Laplace 
approximation to work reasonably well for a Gauchy prior. We now briefly describe 
the alternative approximation scheme proposed by Gelman et al. (20081 for Student 
priors, which we call for convenience Laplace-EM. 

Laplace-EM is based on the well-known representation of a Student distribu¬ 
tion, ^ Ni(0,cr|), ^ Inv — Gamma(i^/2, Sji//2); take = 1 to recover 

our Gauchy prior. Gonditional on cr^ = (af,... ,(Jp), the prior on (3 is Gaussian, 
hence, for a fixed cr^ one may implement Newton-Raphson to maximise the log- 
density of p{l3\cr'^, and deduce a Laplace (Gaussian) approximation of the same 
distribution. 


Laplace-EM is an approximate EM (Expectation Maximisation, Dempster et al. 


1977) algorithm, which aims at maximising in cr^ = (crj,..., Up) the marginal pos¬ 


terior distribution p{cr^\'D) = Jp(cr^,/3|I?) d/3. Each iteration involves an expect¬ 
ation with respect to the intractable conditional distribution p(/3|(T^, I?), which 
is Laplace approximated, using a single Newton-Raphson iteration. When this 
approximate EM algorithm has converged to some value cr^, one more Newton- 
Raphson iteration is performed to compute a final Laplace approximation of p(/3|cr^, D), 
which is then reported as a Gaussian approximation to the posterior. We refer the 
readers to Gelman et al. (2008) for more details on Laplace-EM. 


3.4. Expectation-Propagation. Like Laplace, Expectation Propagation (EP, Minka 


20011 generates a Gaussian approximation of the posterior, but it is based on dif¬ 
ferent ideas. The consensus in machine learning seems to be that EP provides a 
better approximation than Laplace (e.g. Nickisch and Rasmussen, 2008); the in¬ 
tuition being that Laplace is ‘too local’ (i.e. it fitted so at to match closely the 
posterior around the mode), while EP is able to provide a global approximation to 
the posterior. 

Starting from the decomposition of the posterior as product of (nx> + 1) factors: 


. n-D 

TT k{(3), li{(3) = F(y,0^Xi) for i > 1, 

and Iq is the prior, Iq{(3) = p(/3), EP computes iteratively a parametric approxim¬ 
ation of the posterior with the same structure 

n-D . 

(3.2) 9Ep(/3) = nw9*(/3)- 

i=0 

Taking qi to be an unnormalised Gaussian densities written in natural exponential 
form 

gi(/3) = exp |-^/3^Q,/3 -k/3^n|, 
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one obtains for qep a Gaussian with natural parameters Q — Qi “ 

Sr=o’’b note that the more standard parametrisation of Gaussians may be re¬ 
covered by taking 

S = Q~^, n = Q^^r. 

Other exponential families could be considered for q and the qi’s, see e.g. |Seeger| 
(20051, but Gaussian approximations seems the most natural choice here. 


An EP iteration consists in updating one factor qi, or equivalently {Zi,Qi,ri), 
while keeping the other factors as fixed, by moment matching between the hybrid 
distribution 

h{f3) ^ k{(3)l[q,{(3) 

and the global approximation q defined in (|3.2|): compute 


Zh= 






Sh = 




^/3^Z,(/3)n9j (/3) d/3 

3¥=i 


and set 

Qi = — Q_i 




Hh -r_„ log Zi = logZh - '^ir,Q) + 


where r_i = J2j^i Q-i = Qj^ ^^^d ip{r, Q) is the normalising constant of 
a Gaussian distribution with natural parameters (r, Q), 


tpir, Q)= exp 
Jrp 


1 ^ 

-2/3^Q/3- 




d/3 = log |Q/27r| -t ^r'^Qr. 


In practice, EP proceeds by looping over sites, updating each one in turn until 
convergence is achieved. 

To implement EP for binary regression models, two points must be addressed. 
First, how to compute the hybrid moments? For the probit model, these moments 
may be computed exactly, see the supplement, while for the other links function 
(such as logistic), numerical (one-dimensional) quadrature may be used. Second, 
how to deal with the prior? If the prior is Gaussian, one may simply set go fo the 
prior, and never update go in the course of the algorithm. For a Gauchy prior, go 
is simply treated as an extra site. 

EP being a fairly recent method, it is currently lacking in terms of supporting 
theory, both in terms of algorithmic convergence (does it converge in a finite num¬ 
ber of iterations?), and statistical convergence (does the resulting approximation 
converges in some sense to the true posterior distribution as u-p —t -l-oo?). On the 
other hand, there is mounting evidence that EP works very well in many problems; 
again see e.g. 


Nickisch and Rasmussen (e.g. 20081. 


3.5. Discussion of the different approximation schemes. Laplace and its 
variants have complexity 0{nv + p^), while EP has complexity 0{nxip^)- Incid¬ 
entally, one sees that the number of covariates p is more critical than the number 
of instances np in determining how ‘big’ (how time-intensive to process) is a given 
dataset. This will be a recurring point in this paper. 

The p^ term in both complexities is due to the pxp matrix operations performed 


by both algorithms; e.g. the Newton-Raphson update (2.21 requires solving a linear 
system of order p. EP requires to perform such p^ operations at each site (i.e. for 
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each single observation), hence the 0{nxip^) complexity, while Laplace perform such 
operations only once per iteration. EP is therefore expected to be more expensive 
than Laplace. 

This remark may be mitigated as follows. First, one may modify EP so as 
to update the global approximation only at the end of each iteration (complete 
pass over the data). The resulting algorithm (van Gerven et al., 2010[ | may be 
easily implemented on parallel hardware: simply distribute the nx> factors over 
the processors. Even without parallelisation, parallel EP requires only one single 
matrix inversion per iteration. 

Second, the ‘improved Laplace’ approximation for the marginals described in 
Section |3.1| requires to perform quite a few basic Laplace approximations, so its 
speed advantage compared to standard EP essentially vanishes. 

Points that remain in favour of Laplace is that it is simpler to implement than 
EP, and the resulting code is very generic: adapting to either a different prior, or a 
different link function (choice of E in 2.11, is simply a matter of writing a function 
that evaluates the corresponding function. We have seen that such an adaptation 
requires more work in EP, although to be fair the general structure of the algorithm 
is not model-dependent. On the other hand, we shall see that EP is often more 
accurate, and works in more examples, than Laplace; this is especially the case for 
the Cauchy prior. 


4. Exact methods 

We now turn to sampling-based methods, which are ‘exact’, at least in the limit: 
one may make the approximation error as small as desired, by running the corres¬ 
ponding algorithm for long enough. We will see that all of these algorithms requires 
some form of calibration that requires prior knowledge on the shape of the posterior 
distribution. Since the approximation methods covered in the previous section are 
faster by orders of magnitude than sampling-based methods, we will assume that a 
Gaussian approximation q{P) (say, obtained by Laplace or EP) has been computed 
in a preliminary step. 

4.1. Our gold standard: Importance sampling. Let q{f3) denote a generic 
approximation of the posterior p{f3\T>). Importance sampling (IS) is based on the 
trivial identity 


piV) = I pi/3)piV\f3)df3 = I g(/3)^M|Md/3 

which leads to the following recipe: sample /3i,... ^ (3^ ~ q, then compute as an 
estimator of p{'D) 


1 . 

( 4 . 1 ) Zn = 

n—1 

In addition, since 


w{(3) := 


p(/3)p(P|/3) 

9 (/ 3 ) 


p{f3)pif3\V) d/3 = 


/ </5(/3)g(/3)w(/3) d/3 


J q{f3)w{f3) d/3 
one may approximate any posterior moment as 


J2n=lWi/3nMf3n) 


(4.2) 


Pn = 




Approximating posterior marginals is also straightforward; one may for instance 


use kernel density estimation on the weighted sample {I3n,w{l3n))n=i- 
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Concerning the choice of q, we will restrict ourselves to the Gaussian approxima¬ 
tions generated either from Laplace or EP algorithm. It is sometimes recommended 
to use a Student distribution instead, as a way to ensure that the variance of the 
above estimators is finite, but we did not observe any benefit for doing so in our 
simulations. 

It is of course a bit provocative to call IS our gold standard, as it is sometimes 
perceived as an obsolete method. We would like to stress out however that IS is 
hard to beat relative to most of the criteria laid out in the introduction: 


• because it is based on IID sampling, assessing the Monte Carlo error of 
the above estimators is trivial: e.g. the variance oi Zn may be estimated 
as N~^ times the empirical variance of the weights w{f3n)- The auto- 
normalised estimator |4.2| has asymptotic variance 


E„ 




q^{(3)p{f3\V) df3 


which is also trivial to approximate from the simulated (3nS. 

• Other advantages brought by IID sampling are: (a) importance sampling 
is easy to parallelize; and (b) importance sampling is amenable to QMC 
(Quasi-Monte Carlo) integration, as explained in the following section. 

• Importance sampling offers an approximation of the marginal likelihood 
p{'D) at no extra cost. 

• Code is simple and generic. 


Of course, what remains to determine is whether importance sampling does well 
relative to our main criterion, i.e. error versus CPU trade-off. We do know that IS 
suffers from a curse of dimensionality: take both q and and the target density tt to 
be the density of IID distributions: q{l3) = 11^=1 qiiPj), = 11^=1 7ri(/?j); then 
it is easy to see that the variance of the weights grows exponentially with p. Thus 
we expect IS to collapse when p is too large; meaning that a large proportion of the 
(5n gets a negligible weight. On the other hand, for small to moderate dimensions, 
we will observe surprising good results; see Section]^ We will also present below 
a SMC algorithm that automatically reduces to IS when IS performs well, while 
doing something more elaborate in more difficult scenarios. 

The standard way to assess the weight degeneracy is to compute the effective 

T9Mt , 


sample size (Kong et al. 


ESS = 




(/3n)}^ 




e [i,N], 


which roughly approximates how many simulations from the target distribution 
would be required to produce the same level of error. In our simulations, we will 
compute instead the efficiency factor EF, which is simply the ratio EF = ESS/iV. 


4.2. Improving importance sampling by Quasi-Monte Carlo. Quasi-Monte 
Carlo may be seen as an elaborate variance reduction technique: starting from the 
Monte Carlo estimators and see (4.11 and ( |4.2[ ), one may re-express the 
simulated vectors as functions of uniform variates in [0, l]'^; for instance: 


fdn — M “f ^Cn: Cn — ^ id^n) 

where is the fV(0,1) inverse CDF, applied component-wise. Then, one 
replaces the N vectors by a low-discrepancy sequence; that is a sequence of N 
vectors that spread more evenly over [0, l]'^; e.g. a Halton or a Sobol’ sequence. 
Under appropriate conditions, QMC error converges at rate for any 

e > 0, to be compared with the standard Monte Carlo rate We refer 
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to Lemieux (20091 for more background on QMC, as well as how to construct QMC 


sequences. 

Oddly enough, the possibility to use QMC in conjunction with importance 
sampling is very rarely mentioned in the literature; see however |Hormann and| 
Leydold (20051. More generally, QMC seems often overlooked in Statistics. We 


shall see however that this simple IS-QMC strategy often performs very well. 

One drawback of IS-QMC is that we lose the ability to evaluate the approx¬ 
imation error in a simple manner. A partial remedy is to use randomised Quasi- 
Monte Carlo (RQMC), that is, the are generated in such a way that (a) with 
probability one, Ui-j^ is a QMC point set; and (b) each vector is marginally 
sampled from [0,1]'^. Then QMC estimators that are empirical averages, such as 
Zfq = N~^ '^(Pn) become unbiased estimators, and their error may be as¬ 

sessed through the empirical variance over repeated runs. Technically, estimators 
that are ratios of QMC averages, such as (pj^, are not unbiased, but for all practical 
purposes their bias is small enough that assessing error through empirical variances 
over repeated runs remains a reasonable approach. 


4.3. MCMC. The general principle of MCMC (Markov chain Monte Carlo) is to 
simulate a Markov chain that leaves invariant the posterior distribution p(/3|I?); 
see Robert and Casella (20041 for a general overview. Often mentioned drawbacks 


of MCMC simulation are (a) the difficulty to parallelize such algorithms (although 


see e.g. Jacob et al., 2011 for an attempt at this problem); (b) the need to specify a 
good starting point for the chain (or alternatively to determine the burn-in period, 
that is, the length of the initial part of the chain that should be discarded) and 
(c) the difficulty to assess the convergence of the chain (that is, to determine if 
the distribution of 3* at iteration t is sufficiently close to the invariant distribution 

Pipm- 

To be fair, these problems are not so critical for binary regression models. Re¬ 
garding (b), one may simply start the chain from the posterior mode, or from a draw 
of one of the Gaussian approximations covered in the previous section. Regarding 
(c) for most standard datasets, MCMC converges reasonably fast, and convergence 
is easy to assess visually. The main issue in practice is that MCMC generates cor¬ 
related random variables, and these correlations inflate the Monte Carlo variance. 


4.3.1. Gibbs sampling. Consider the following data-augmentation formulation of 
binary regression: 

Zj = 0’^Xi + a 

Vi = sgn(zi) 

where z = (zi,..., z„^)^ is a vector of latent variables, and assume for a start 
that €i ~ A^(0,1) (probit regression). One recognises p{(3\z.,'D) as the posterior 
of a linear regression model, which is tractable (for an appropriate prior). This 
suggests to sample from p{f3,z\'D) using Gibbs sampling ( Albert and Chib} [1993 1: 
i.e. iterate the two following steps: (a) sample from z|/3,I?; and (b) sample from 
l3\z,V. 

For (a), the z^’s are conditionally independent, and follows a truncated Gaussian 
distribution 

p{zi\(3,'D) oc Ni 1) 1 {ziyi > 0} 

which is easy to sample from ( Chopin[ |2011[ |. For Step (b) and a Gaussian prior 
Np(0, Sprior), one has, thanks to standard conjugacy properties: 


/3|z;,X> ~ Np (pLpostiz), Spost) : 


y-l _ 

■^post •^prior 


XX , /Jpost(-2) — ^post®-^ 
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Algorithm 1 Hastings-Metropolis iteration 

Input: (3 
Output: [3' 

1: Sample [3* ~ k(/3*|/3). 

2: With probability 1 A r, 

p(/3-)p(I?|/3-)ac(/3|/3-) 
p(/3)p(I?|/3)«(/3*|/3) 
set /3' = f3*-, otherwise set f3' = f3. 


where x is the n x p matrix obtained by stacking the xf. Note that Spost and its 
inverse need to be computed only once, hence the complexity of a Gibbs iteration 
is O(p^), not 0{p^). 

The main drawback of Gibbs sampling is that it is particularly not generic: its 
implementation depends very strongly on the prior and the model. Sticking to the 
probit case, switching to another prior requires deriving a new way to update I3\z, T). 
For instance, for a prior which is a product of Students with scales aj (e.g. our 
Gauchy prior), one may add extra latent variables, by resorting to the well-known 
representation: Pj\sj ~ Ni(0, pcrj/s^), sj ^ Chi^(p); with v = 1 for our Gauchy 
prior. Then the algorithm has three steps: (a) an update of the Zi’s, exactly as 
above; (b) an update of /3, as above but with Sprior replaced by the diagonal matrix 
with elements i/a'j jsj, j = 1,... ,p; and (c) an (independent) update of the p latent 
variables Sj, with Sj\f3, z,V ^ Gamma ((1 -I- v)/2, (l -|- vl3'j/a"^) /2). The complex¬ 
ity of Step (b) is now 0{p^), since SpHor and Spost must be recomputed at each 
iteration (although some speed-up may be obtained by using Sherman-Morrison 
formula). 

Of course, considering yet another type of prior would require deriving another 
strategy for sampling (3. Then if one turns to logistic regression, things get rather 
complicated. In fact, deriving an efficient Gibbs sampler for logistic regression 
is a topic of current research; see Holmes and Held (20061; Fruhwirth-Schnatter] 
and Friihwirth (20091; Gramacy and Poison (2012|; Poison et al. (20131. In a 


nutshell, the two first papers use the same data augmentation as above, but with 
Ci ^ Logistic(l) written as a certain mixture of Gaussians (infinite for the first 
paper, finite but approximate for the second paper), while Poison et al. (20131 use 
instead a representation of a logistic likelihood as an infinite mixture of Gaussians, 
with a Polya-Gamma as the mixing distribution. Each representation leads to 
introducing extra latent variables, and discussing how to sample their conditional 
distributions. 

Since their implementation is so model-dependent, the main justification for 
Gibbs samplers should be their greater performance relative to more generic al¬ 
gorithms. We will investigate if this is indeed the case in our numerical section. 


4.3.2. Hastings-Metropolis. Hastings-Metropolis consists in iterating the step de¬ 
scribed as Algorithm Much like importance sampling, Hastings-Metropolis is 
both simple and generic, that is, up to the choice of the proposal kernel ft:(/3*|/3) 
(the distribution of the proposed point (3*, given the current point (3). A naive 
approach is to take k{I3*\(3) independent of /3, k(/3*|/3) = q{(3*), where q is some 
approximation of the posterior. In practice, this usually does not work better than 
importance sampling based on the same proposal, hence this strategy is hardly 
used. 
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Algorithm 2 Leap-frog step 

Input: (/3, a) 

Output: (/3 i,q;i) 

1 : 0 ^ 1/2 ^ OL— pE{l3) 

2: /3i t— /3 -|- (■OL\j2 

3: oti ^ OL \ i 2 ~ f'^/3£'(/3i) 


A more usual strategy is to set the proposal kernel to a random walk: k(^*|/ 3) = 
Np(/3, Sprop)- It is well known that the choice of Sprop is critical for good perform¬ 
ance. For instance, in the univariate case, if Sprop is too small, the chain moves 
slowly, while if too large, proposed moves are rarely accepted. 

A result from the optimal scaling literature (e.g. 


Roberts and Rosenthal 20011 


is that, for a Np(0, Jp) target, Sprop = (A^/p)ip with A = 2.38 is asymptotically 
optimal, in the sense that as p —)■ oo, this choice leads to the fastest exploration. 
Since the posterior of a binary regression model is reasonably close to a Gaussian, 
we adapt this result by taking Sprop = lp)^q in our simulations, where Sg is the 
covariance matrix of a (Laplace or EP) Gaussian approximation of the posterior. 
This strategy seems validated by the fact we obtain acceptance rates close to the 
optimal rate, as given by Roberts and Rosenthal (20011. 


The bad news behind this optimality result is that the chain requires 0(p) steps 
to move a 0(1) distance. Thus random walk exploration tends to become slow for 
large p. This is usually cited as the main motivation to develop more elaborate 
MGMG strategies, such as HMG, which we cover in the following section. 

4.3.3. HMC. Hamiltonian Monte Garlo (HMG, also known as Hybrid Monte Garlo, 


Duane et al. 19871 is a new type of MGMG algorithm, where one is able to per¬ 


form several steps in the parameter space before determining if the new position 
is accepted or not. Gonsequently, HMG is able to make much bigger jumps in 


the parameter space than standard Metropolis algorithms. See Neal (20101 for an 
excellent introduction. 

Gonsider the pair (/3, a), where j3 ~ p(f3\'D), and a ~ Np(0, M~^), thus with 
joint un-normalised density exp {—H(f3, ct)}, with 


H{(3, a) = E{(3) + 


E{f3) = - log {p(P)piV\f3)} . 


The physical interpretation of HMG is that of a particle at position (3, with velo¬ 
city a, potential energy E(/3), kinetic energy ^cPMa, for some mass matrix M, 
and therefore total energy given by El(13, a). The particle is expected to follow a 
trajectory such that E[(f3,cx) remains constant over time. 

In practice, HMG proceeds as follows: first, sample a new velocity vector, 
a ~ Np(0, M~^). Second, move the particle while keeping the Hamiltonian El con¬ 
stant; in practice, discretisation must be used, so L steps of step-size e are performed 
through leap-frop steps; see Algorithm which describes one such step. Third, the 
new position, obtained after L leap-frog steps is accepted or rejected according to 
probability 1 A exp {iJ(/3, a) — E{(f3*,a*)}; see Algorithm for a summary. The 
validity of the algorithm relies on the fact that a leap-frog step is “volume pre¬ 
serving”; that is, the deterministic transformation ((3, a) —>■ (j3i,a.i) has Jacobian 
one. This is why the acceptance probability admits this simple expression. 

The tuning parameters of HMG are M (the mass matrix), L (number of leap¬ 
frog steps), and e (the stepsize). For M, we follow [Neal (2010|’s recommendation 


and take M ^ = Sq, an approximation of the posterior variance (again obtained 
from either Laplace or EP). This is equivalent to rescaling the posterior so as to 
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Algorithm 3 HMC iteration 

Input: (3 
Output: [3' 

1: Sample momentum a. ^ Np(0, M). 

2: Perform L leap-frog steps (see Algorithm]^, starting from {f3,a)-, call 
{(3*, a.*) the final position. 

3: With probability 1 A r, 

r = exp {H{(3, a) — H{l3*, a*)} 

set (3 = (3*-, otherwise set ^ = /3. 


have a covariance matrix close to identity. In this way, we avoid the bad mixing 
typically incurred by strong correlations between components. 

The difficulty to choose L and e seems to be the main drawback of HMC. The 
performance of HMC seems very sensitive to these tuning parameters, yet clear 
guidelines on how to choose them seem currently lacking. A popular approach is to 


20081 


fix Le to some value, and to use vanishing adaptation (Andrieu and Thoms 
to adapt e so as to target acceptance rate of 0.65 (the optimal rate according to 
the formal study of HMC by [Beskos et al. 2013[ |: i.e. at iteration t, take e = et, 
with et = et-i — r]t{Rt — 0.65), % = t~^, k G (1/2,1) and Rt the acceptance rate 
up to iteration t. The rationale for fixing Le is that quantity may be interpreted as 
a ‘simulation length’, i.e. how much distance one moves at each step; if too small, 
the algorithm may exhibit random walk behaviour, while if too large, it may move 
a long distance before coming back close to its starting point. Since the spread of is 
already taken into account through we took eL = 1 in our simulations. 


4.3.4. NUTS and other variants of HMC. Girolami and Calderhead (20111 pro¬ 


posed an interesting variation of HMC, where the mass matrix M is allowed to 
depends on /3; e.g. M{(3) is set to the Fisher information of the model. This allows 
the corresponding algorithm, called RHMC (Riemanian HMC), to adapt locally 
to the geometry of the target distribution. The main drawback of RHMC is that 
each iteration involves computing derivatives of M{j3) with respect to (3, which is 
very expensive, especially if p is large. For binary regression, we found RMHC to 
be too expensive relative to plain HMC, even when taking into account the better 
exploration brought by RHMC. This might be related to the fact that the posterior 
of a binary regression model is rather Gaussian-like and thus may not require such 
a local adaptation of the sampler. 


We now focus on NUTS (No U-Turn sampler, Hoffman and Gelman 20131, 
a variant of HMC which does not require to specify a priori L, the number of 
leap-frog steps. Instead, NUTS aims at keeping on doing such steps until the 
trajectory starts to loop back to its initial position. Of course, the difficulty in 
this exercise is to preserve the time reversibility of the simulated Markov chain. 
To that effect, NUTS constructs iteratively a binary tree whose leaves correspond 
to different velocity-position pairs (ck,/ 3) obtained after a certain number of leap¬ 
frog steps. The tree starts with two leaves, one at the current velocity-position 
pair, and another leaf that corresponds to one leap-frop step, either in the forward 
or backward direction (i.e. by reversing the sign of velocity); then it iteratively 
doubles the number of leaves, by taking twice more leap frog steps, again either 
in the forward or backward direction. The tree stops growing when at least one 
leaf corresponds to a “U-turn”; then NUTS chooses randomly one leaf, among those 
leaves that would have generated the current position with the same binary tree 
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mechanism; in this way reversibility is preserved. Finally NUTS moves the new 
position that corresponds to the chosen leaf. 

We refer the readers to Hoffman and Gelman (20131 for a more precise description 
of NUTS. Given its complexity, implementing directly NUTS seems to require more 
efforts than the other algorithms covered in this paper. Fortunately, the STAN 
package (http;//mc-stan. org/) provides a GH—h implementation of NUTS which 
is both efficient and user-friendly: the only required input is a description of the 
model in a probabilistic programming language similar to BUGS. In particular, 
STAN is able to automatically derive the log-likelihood and its gradient, and no 
tuning of any sort is required from the user. Thus, we will use STAN to assess 
NUTS in our numerical comparisons. 


4.4. Sequential Monte Carlo. Sequential Monte Garlo (SMG) is a class of al¬ 
gorithms for approximating iteratively a sequence of distributions ttj, t = 0,..., T, 
using importance sampling, resampling, and MGMG steps. We focus here on the 
non-sequential use of SMG (Neal, 2001 Ghopin, 2002 Del Moral et al.[ 20061, where 
one is only interested in approximating the final distribution tt-t (in our case, set 
to the posterior p(/3|I?)), and the previous tt^’s are designed so as to allow for a 
smooth progression from some tiq, which is easy to sample from, to tt^. 

At iteration t, SMG produces a set of weighted particles (simulations) (/3„, Wn)n=i 
that approximates ttj, in the sense that 


—^ Wn(pil3n) E’"* [(Pif3)] 

Z^n=l n=l 

as N ^ -too. At time 0, one samples /J" ~ tto, and set = 1. To progress 
from TTt-i to TTt, one uses importance sampling: weights are multiplied by ratio 
'^t{l3n) When the variance of the weights gets too large (which indicates 

that too few particles contribute significantly to the current approximation), one 
resamples the particles: each particle gets reproduced On times, where On > 0 
is random, and such that E(0„) = Nwn /and ^n=i^n = N with 
probability one. In this way, particles with a low weights are likely to die, while 
particles with a large weight get reproduced many times. Finally, one may re¬ 
introduce diversity among the particles by applying one (or several) MGMG steps, 
using a MGMG kernel that leaves invariant the current distribution tt^. 

We focus in this paper on tempering SMG, where the sequence 

7rt(/3) (X {p{(3)p{V\f3)Y* 


corresponds to a linear interpolation (on the log-scale) between some distribution 
ttq = q, and = p(/3|I?), our posterior. This is a convenient choice in our 

case, as we have at our disposal some good approximation q (either from Laplace 
or EP) of our posterior. A second advantage of tempering SMG is that one can 
automatically adapt the “temperature ladder” St (Jasra et al.| 2011[|. Algorithm]^ 


describes a tempering SMG algorithm based on such an adaptation scheme: at each 
iteration, the next distribution ttj is chosen so that the efficiency factor (defined in 
Section 4.11 of the importance sampling step from ivt-i to ttj equals a pre-defined 
level r S (0,1); a default value is r = 1/2. 

Another part of Algorithm which is easily amenable to automatic calibration 
is the MGMG step. We use a random walk Metropolis step, i.e. Algorithm 
with proposal kernel K{f3*\(3) = Np(/3, Sprop), but with Sp^op calibrated to the 
empirical variance of the particles S: Sprop = AS, for some A. Finally, one may 
also automatically calibrate the number m of MGMG steps, as in Ridgway (20141, 
but in our simulations we simply took to = 3. 
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Algorithm 4 tempering SMC 


Operations involving index n must be performed for all n £ 1 : iV. 
0 : Sample /3„ ~ q{(3) and set J ^ 0. 

1 : Let, for <5 S [5,1], 


EF((5) = 


1 {Etc 


r(/3n)}^ 






us{l3) = 


{mpijmv 

i m I 


If EF(1) > T, stop and return {f3n,Wn)n=i-.N with Wn = iti(/3„); otherwise, 
use the bisection method ( [Press et al.[ 2007 Chap. 9) to solve numerically 
in 6 the equation EE ( 7 ) = r. 

2 : Resample according to normalised weights Wn = Em=i '^"*1 with 
Wn = usiPn)', see the supplement for one such resampling algorithm. 

3: Update the (3n’s through m MCMC steps that leaves invariant 7 rt(/ 3 ), using 
e.g. Algorithm with /t(/3*|/3) = Np(/3, Sprop), Sp^op = AS, where S is 
the empirical covariance matrix of the resampled particles. 

4: Set S^S. Go to Step 1. 


In the end, one obtains essentially a black-box algorithm. In practice, we shall 
often observe that, for simple datasets, our SMC algorithm automatically reduces 
to a single importance sampling step, because the efficiency factor of moving from 
the initial distribution q to the posterior is high enough. In that case, our SMC 
sampler performs exactly as standard importance sampling. 

Finally, we note that the reweighting step and the MCMC steps of Algorithm [^ 
are easy to parallelise. 


5. Numerical study 


The point of this section is to compare numerically the different methods dis¬ 
cussed in the previous sections, first on several datasets of standard size (that are 
representative of previous numerical studies), then in a second time on several big¬ 
ger datasets. 

We focus on the following quantities: the marginal likelihood of the data, p{p), 
and the p marginal posterior distributions of the regression coefficients Pj. Regard¬ 
ing the latter, we follow Faes et al. (20111 in defining the ‘marginal accuracy’ of 
approximation q for component j to be 


MA, = 1 - - 
^ 2 


r>+oo 


\q{P,)-p{P,\D)\ d/3, 


This quantity lies in [0,1], and is scale-invariant. Since the true marginals p{Pj\'D) 
are not available, we will approximate them through a Gibbs sampler run for a 
very long time. To give some scale to this criterion, assume q{Pj) = Ni(/3j;/ii,cr^), 
p{Pj\'D) = Ni(/3,; /r 2 , cr^), then MA, is 2$(—5/2) « 1 — 0.4 x 5 for 5 = \pi — pi 2 \l^ 
small enough; e.g. 0.996 for 5 « 0.01, 0.96 for 6 « 0.1. 

In our results, we will refer to the following four prior/model ‘scenarios’: Gaus¬ 
sian/probit, Gaussian/logit, Cauchy/probit, Cauchy/logit, where Gaussian and 
Cauchy refer to the two priors discussed in Section |2.1| All the algorithms have 
been implemented in C++, using the Armadillo and Boost libraries, and run on 
a standard desktop computer (except when explicitly stated). Results for NUTS 
were obtained by running STAN (http://mc-stan.org/) version 2.4.0. 


5.1. Datasets of moderate size. Table [T] lists the 7 datasets considered in this 
section (obtained from the UCI machine learning repository, except Elections, which 
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Dataset 

n-D 

P 

Pima (Indian diabetes) 

532 

8 

German (credit) 

999 

25 

Heart (Statlog) 

270 

14 

Breast (cancer) 

683 

10 

Liver (Indian Liver patient) 

579 

11 

Plasma (blood screening data) 

32 

3 

Australian (credit) 

690 

15 

Elections 

2015 

52 


Table 1. Datasets of moderate size (from UCI repository, 
Elections, from web-site of Gelman and Hill (2006)’s book) 
(short and long version), number of instances nx>, number 
ariates p (including an intercept) 


except 
: name 
of cov- 



0.4 0.6 0.8 1.0 

value 


^ A 

• ^ ^ 
4 


Figure 5.1. Comparison of approximation schemes across all 
datasets of moderate size: marginal accuracies (left), and abso¬ 
lute error for log-evidence versus the dimension p (right); x—axis 
range of the left plot determined by range of marginal accuracies 
(i.e. marginal accuracy may drop below 0.4 for e.g. Laplace-EM). 


is available on the web page of Gelman and Hill (20061’s book). These datasets are 
representative of the numerical studies found in the literature. In fact, it is a super¬ 


et al. ( 

20111 , 

dolmes and Held 

(20061 and also (up to one dataset with 5 covariates) 

Poison et al. 

(2013 

|. In each case, an intercept have been included; i.e. p is the 


number of predictors plus one. 


5.1.1. Fast Approximations. We compare the four approximation schemes described 
in Section]^ Laplace, Improved Laplace, Laplace EM, and EP. We concentrate on 
the Cauchy/logit scenario for two reasons: (i) Laplace EM requires a Student prior; 
and (ii) Cauchy/logit seems the most challenging scenario for EP, as (a) a Gauchy 
prior is more difficult to deal with than a Gaussian prior in EP ; and (b) contrary 
to the probit case, the site update requires some approximation; see Section [3.4| for 
more details. 
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(a) Pima 


(b) Heart 



(c) Breast (d) German 


Figure 5.2. Box-plots of marginal accuracies across the p dimen¬ 
sions, for the four approximation schemes, and four selected data¬ 
sets; plots for remaining datasets are in the supplement. For the 
sake of readability, scale of y—axis varies across plots. 


Left panel of Fig. |5.1| plots the marginal accuracies of the four approximation 
schemes across all components and all datasets; Fig. |5.2| does the same, but separ¬ 
ately for four selected datasets; results for the remaining datasets are available in 
the supplement. 

EP seems to be the most accurate method on these datasets: marginal accuracy 
is about 0.99 across all components for EP, while marginal accuracy of the other 
approximation schemes tend to be lower, and may even drop to quite small values; 


see e.g. the German dataset, and the left tail in the left panel of Fig. 5.1 


EP also fared well in terms of CPU time: it was at most seven times as intensive 
as standard Laplace across the considered datasets, and about 10 to 20 times faster 
than Improved Laplace and Laplace EM. As expected (see Section 3.51. Of course, 
the usual caveats apply regarding CPU time comparison, and how they may depend 
on the hardware, the implementation, and so on. 
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We also note in passing the disappointing performance of Laplace EM, which 
was supposed to replace standard Laplace when the prior is Student, but which 
actually performs not as well as standard Laplace on these datasets. 

We refer the reader to the supplement for similar results on the three other 
scenarios, which are consistent with those above. In addition, we also represent the 
approximation error of EP and Laplace for approximating the log-evidence in the 
right panel of Fig. |5.1| Again, EP is found to be more accurate than Laplace for 
most datasets (except for the Breast dataset). 

To conclude, it seems that EP may be safely be used as a complete replacement 
of sampling-based methods on such datasets, as it produces nearly instant results, 
and the approximation error along all dimensions is essentially negligible. 


5.1.2. Importance sampling, QMC. We now turn to importance sampling (IS), 
which we deemed our “gold standard” among sampling-based methods, because 
of its ease of use and other nice properties as discussed in Section |4.1| We use 
TV = 5 X 10® samples, and a Gaussian EP proposal. (Results with a Laplace pro¬ 
posal are roughly similar.) We consider first the Gaussian/probit scenario, because 
this is particularly favorable to Gibbs sampling; see next section. Table reports 
for each dataset the efficiency factor of IS (as defined in Section 4.1), the GPU time 
and two other quantities discussed below. 



IS 

IS-QMG 

Dataset 

EE 

GPU 

MT 

MSE improv. 

MSE improv. 


= ESS/iV 

time 

speed-up 

(expectation) 

(evidence) 

Pima 

99.5% 

37.54 s 

4.39 

28.9 

42.7 

German 

97.9% 

79.65 s 

4.51 

13.2 

8.2 

Breast 

82.9% 

50.91 s 

4.45 

2.6 

6.2 

Heart 

95.2% 

22.34 s 

4.53 

8.8 

9.3 

Liver 

74.2 % 

35.93 s 

4.76 

7.6 

11.3 

Plasma 

90.0% 

2.32 s 

4.28 

2.2 

4.4 

Australian 

95.6% 

53.32 s 

4.57 

12 

20.3 

Elections 

21.39% 

139.48 s 

3.87 

617.9 

3.53 


Table 2. Performance of importance sampling (IS), and 
QMG importance sampling (IS-QMG), on all datasets, in Gaus¬ 
sian/probit scenario: efficiency factor (EE), GPU time (in seconds), 
speed gain when using multi-threading Intel hyper-threaded quad 
core GPU (Speed gain MT), and efficiency gain of QMG (see text). 


We see that all these efficiency factors are all close to one, which means IS works 
almost as well as IID sampling would on such datasets. Further improvement 
may be obtained by using either parallelization, or QMG (Quasi-Monte Garlo, see 
Section 4.2). Table reports the speed-up factor obtained when implementing 
multi-threading on our desktop computer which has a multi threading quad core 
GPU (hence 8 virtual cores). We also implemented IS on an Amazon EG2 instance 
with 32 virtual GPUs, and obtained speed-up factors about 20, and running times 
below 2s. 

Finally, Tablej^also reports the MSE improvement (i.e. MSE ratio of IS relative 
to IS-QMG) obtained by using QMG, or more precisely RQMG (randomised QMG), 
based on a scrambled Sobol’ sequence (see e.g. Lemieux, 2009). Specifically, 


the table reports the median MSE improvement for the p posterior expectations 
(first column), and the MSE improvement for the evidence (second column). The 
improvement brought by RQMG varies strongly across datasets. 
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(a) Median IRIS for the p pos¬ 
terior expectations E[/3j|X)] 


(b) Median IRIS for the p pos¬ 
terior variances Var[/3j|I?] 


Figure 5.3. IRIS (Inefficiency relative to importance sampling) 
across all datasets for MCMC schemes and Gaussian/probit scen¬ 
ario; left (resp. right) panel shows median IRIS when estimating 
the p posterior expectations (resp. the p posterior variances). 


The efficiency gains brought by parallelization and QMC may be combined, 
because the bulk of the computation (as reported by a profiler) is the N likelihood 
evaluations, which are trivial to parallelize. 

It is already clear that other sampling-based methods do not really have a fighting 
chance on such datasets, but we shall compare them in the next section for the sake 
of completeness. See also the supplement for results for other scenarios, which are 
very much in line with those above. 

5.1.3. MCMC schemes. In order to compare the different sampling-based methods, 
we define the IRIS (Inefficiency Relative to Importance Sampling) criterion, for a 
given method M and a given posterior estimate, as follows: 

MSEm CPU/s 
MSEis ^ CPUm 

where MSEm (resp. MSE/g) is the mean square error of the posterior estimate 
obtained from method M (resp. from importance sampling), and CPUm the CPU 
time of method M (resp. importance sampling). The comparison is relative to 
importance sampling without parallelisation or quasi-Monte Carlo sampling. In 
terms of posterior estimates, we consider the expectation and variance of each 
posterior marginal p(/3j|I?). We observe that, in both cases, IRIS does not vary 
much across the p components, so we simply report the median of these p values. 
Fig |5.3| reports the median IRIS across all datasets. We refer the reader to Section 
|4.3| for how we tuned these MCMC algorithms. 

The first observation is that all these MCMC schemes are significantly less effi¬ 
cient than importance sampling on such datasets. The source of inefficiency seems 
mostly due to the autocorrelations of the simulated chains (for Gibbs or random 
walk Metropolis), or, equivalently, the number of leap-frog steps performed at each 
iteration in HMC and NUTS. See the supplement for ACF’s (Autocorrelation plots) 
to support this statement. 
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Dataset 

nv 

P 

Musk 

476 

95 

Sonar 

208 

61 

DNA 

400 

180 


Table 3. Datasets of larger size (from UCI repository): name, 
number of instances nx>, number of covariates p (including an in¬ 
tercept) 


Second, HMC and NUTS do not perform significantly better than random-walk 
Metropolis. As already discussed, HMC-type algorithms are expected to outperform 
random walk algorithms as p —>■ -l-oo. But the considered datasets seem too small 
to give evidence to this phenomenon, and should not be considered as reasonable 
benchmarks for HMC-type algorithms (not to mention again that these algorithms 
are significantly outperformed by IS on such datasets). We note in passing that it 
might be possible to get better performance for HMC by finely tuning the quantities 
e and L on per dataset basis. We have already explained in the introduction why 
we think this is bad practice, and we also add at this stage that the fact HMC 
requires so much more effort to obtain good performance (relative to other MCMC 
samplers) is a clear drawback. 

Regarding Gibbs sampling, it seems a bit astonishing that an algorithm special¬ 
ised to probit regression is not able to perform better than more generic approach on 
such simple datasets. Recall that the Gaussian/probit case is particularly favour¬ 
able to Gibbs, as explained in Section |4.3.1[ See the supplement for a comparison 
of MCMC schemes in other scenarios than Gaussian/probit; results are roughly 
similar, except that Gibbs is more significantly outperformed by other methods, as 
expected. 


5.2. Bigger datasets. Finally, we turn our attention to the bigger datasets sum¬ 
marised by Table These datasets not only have more covariates (than those 
of the previous section), but also stronger correlations between these covariates 
(especially Sonar and Musk). We consider the probit/Gaussian scenario. 

Regarding fast approximations, we observe again that EP performs very well, 
and better than Laplace; see Figure 5.4 It is only for DNA (180 covariates) that 


the EP approximation starts to suffer. 

Regarding sampling-based methods, importance sampling may no longer be used 
as a reference, as the effective sample size collapses to a very small value for these 
datasets. We replace it by the tempering SMC algorithm described in Section 
|4.4[ Moreover, we did not manage to calibrate HMC so as to obtain reasonable 
performance in this setting. Thus, among sampling-based algorithms, the four 
remaining contenders are: Gibbs sampling, NUTS, RWHM (random walk Hastings- 
Metropolis), and tempering SMC. Recall that the last two are calibrated with the 
approximation provided by EP. 

Figure |5.5| reports the “effective sample size” of the output of these algorithms 
when run for the same fixed CPU time (corresponding to 5 x 10® iterations of 
RWHM), for the p posterior expectations (left panels), and the p posterior variances 
(right panels); here “effective sample size” is simply the posterior variance divided 
by the MSE of the estimate (across 50 independent runs of the same algorithm). 

No algorithm seems to vastly outperform the others consistently across the three 
datasets. If anything, RWMH seems to show consistently best or second best 
performance. 
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(a) Musk 


(b) Sonar 


(c) DNA 

Figure 5.4. Marginal accuracies across the p dimensions of EP 
and Laplace, for datasets Musk, Sonar and DNA 


Still, these results offer the following insights. Again, we see that Gibbs sampling, 
despite being a specialised algorithm, does not outperform significantly more generic 
algorithms. Recall that the probit/Gaussian scenario is very favourable to Gibbs 
sampling; in other scenarios (results not shown), Gibbs is strongly dominated by 
other algorithms. 

More surprisingly, RWHM still performs well despite the high dimension. In 
addition, RHHM seems more robust than SMG to an imperfect calibration; see the 
DNA example, where the error of the EP approximation is greater. 

On the other hand, SMG is more amenable to parallelisation, hence on a parallel 
architecture, SMG would be likely to outperform the other approaches. 
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(a) Musk 



(b) Sonar 



(c) DNA 


Figure 5.5. Effective sample size for a fixed CPU time for 
sampling-based algorithms: posterior expectations (left), and pos¬ 
terior variances (right) for datasets (from top to bottom): Musk, 
Sonar, and ADN 
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6. Variable selection 

We discuss in this section the implications of our findings on variable selection. 
The standard way to formalise variable selection is to introduce as a parameter the 
binary vector 7 G {0,1}^, and to define the likelihood 


p ( 25 |/ 3 , 7 ) = ]\F{y,f3'^x-^ 


i=l 


where (3-^ (resp. x^^i) is the vector of length I7I that one obtains by excluding 
from (3 (resp. Xi) the components j such that 7_,- = 0. Several priors may be 
considered for this problem ( [Chipman et al. 2001[ |, but for simplicity, we will take 
p(/3,7) = p{(3)p{'y) where p(/3) is either the Cauchy prior or the Gaussian prior 


discussed in Section 2.1 and p{'y) is the uniform distribution with respect to the 
set {0,1}^, p(7) = 2-P. 

Computationally, variable selection is more challenging than parameter estima¬ 
tion, because the posterior p(/3,7|I?) is a mixture of discrete and continuous com¬ 
ponents. If p is small, one may simply perform a complete enumeration: for all the 
2P possible values of 7, approximate p{'D\'y) using e.g. importance sampling. If p 
is large, one may adapt the approach of [Schafer and Chopin (20111, as described 
in the next sections. 


6.1. SMC algorithm of jSchafer and Chopin (2011). In linear regression, yi = 
0^x-f i + Ei, £i ~ Ni(0, cr^), the marginal likelihood p{'D\^) is available in close 
form (for a certain class of priors). Schafer and Chopin| ( 201l| use this property to 
construct a tempering SMC sampler, which transitions from the prior ^(7) to the 
posterior p{^\D), through the tempering sequence 71^(7) ex p(7)p(I?|7)'^*, with 5t 
growing from 0 to 1. This algorithm has the same structure as Algorithm]^ (with the 
obvious replacements of the /3’s by 7’s and so on.) The only difference is the MCMC 
step used to diversify the particles after resampling. Instead of a random walk step 


(which would be ill-defined on a discrete space), Schafer and Chopin (20111 use a 
Metropolis step based on an independent proposal, constructed from a sequence of 
nested logistic regressions: proposal for first component 71 is Bernoulli, proposal for 
second component 72, conditional on 71, corresponds to a logistic regression with 
7i and an intercept as covariates, and so on. The parameters of these p successive 
regressions are simply estimated from the current particle system. [Schafer andj 


Chopin (20111 show that their algorithm significantly outperform several MCMC 


samplers on datasets with more than 100 covariates. 


6.2. Adaptation to binary regression. For binary regression models, p{T>\'^) 
is intractable, so the approach of Schafer and Chopin ( 2011[ | cannot be applied 
directly. On the other hand, we have seen that (a) both Laplace and EP may 
provide a fast approximation of the evidence p(I?|7); and (b) both importance 
sampling and the tempering SMC algorithm may provide an unbiased estimator of 

pi^h). _ 


Based on these remarks, Schafer (20121 in his PhD thesis considered the following 


extension of the SMC algorithm of Schafer and Chopin (20111: in the sequence 
714(7) oc p(7)p(I?|7)‘^*, the intractable quantity p(I?|7) is simply replaced by an 
unbiased estimator (obtained with importance sampling and the Gaussian proposal 
corresponding to Laplace). The corresponding algorithm remains valid, thanks to 
pseudo-marginal arguments (see e.g. Andrieu and Roberts 20091. Specifically, one 
may re-interpret the resulting algorithm as a SMC algorithm for a sequence of 
distribution of an extended space, such that marginal in 7 is exactly the posterior 
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0.00 0.2S 


(a) Gibbs 


(b) SMC 


Figure 6.1. Variation of estimated inclusion probabilities p(7j = 
1 |I?) over 50 runs for the p covariates of Musk dataset: median (red 
line), 80% confidence interval (white box); the black-box extends 
until the maximum value. 


p{'D\'y) at time t = T. In fact, it may be seen as a particular variant of the SMC^ 
algorithm of Chopin et al. (20131. 


6.3. Numerical illustration. We now compare the proposed SMC approach with 


the Gibbs sampler of Holmes and Held (2006) for sampling from p(f3,'j\'D), on the 
Musk dataset. Both algorithms were given the same CPU budget (15 minutes), 
and were run 50 times; see Figure |6.1[ Clearly, the SMC sampler provides more 
reliable estimates of the inclusion probabilities p{'yj = 1|I?) on such a big dataset. 
See also the PhD dissertation of Schafer ( 2012[ | for results consistent with those, 
on other datasets, and when comparing to the adaptive reversible jump sampler of 


Lamnisos et al. (2013). 


6.4. Spike and slab. We also note in passing that a different approach to the 


variable selection problem is to assign a spike and slab prior to (3 (George and 


McCulloch 19931: 


p 

p{f^) = n {^Ni(/3j;0,u^) -k (1 - A)Ni(/3j;0,u^)} , < uj 

i=i 

where A S (0,1), Vq and vf are fixed hyper-parameters. This prior generates a 
continuous posterior (without point masses at /3j = 0), which is easier to sample 
from than the discrete-continuous mixture obtained in the standard formulation 
of Bayesian variable selection. It would be interesting to see to which extent our 

to this particular type of posteriors; see for instance 
) for how to deal with such priors in EP. 

7. Conclusion and extensions 


discussion and hndings extend 


Hernandez-Lobato et al. (2013 


7.1. Our main messages to users. Our first and perhaps most important mes¬ 
sage to end users is that Bayesian computation (for binary regression) is now suf¬ 
ficiently fast for routine use: if the right approach is used, results may be obtained 
near instantly on a standard computer, at least on simple datasets. 
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Concretely, as far as binary regression is concerned, our main recommendation 
is to always use EP. It is very fast, and its approximation error is negligible in most 
cases (for such models). EP requires some expertise to implement, but the second 
author will release shortly a R package that computes the EP approximation for any 
logit or probit model. The only drawback of EP is the current lack of theoretical 
support. We learnt however while finishing this manuscript that Simon Barthelme 
and Guillaume Dehaene (personal communication) established that the error rate 
of EP is 0{n^) in certain models (where nx) is the sample size). This seems to 
explain why EP often performs so well. 

In case one wishes to assess the EP error, by running in a second step some exact 
algorithm, we would recommend to use the SMC approach outlined in Section [4.4| 
(i.e. with initial particles simulated from the EP approximation). Often, this 
SMC sampler will reduce to a single importance sampling step, and will perform 
extremely well. Even when it does not, it should provide decent performance, 
especially if run on (and implemented for) a parallel architecture. Alternatively, on 
a single-core machine, random walk Metropolis is particularly simple to implement, 
and performs surprisingly well on high-dimensional data (when properly calibrated 
using EP). 


7.2. Our main message to Bayesian computation experts. Our main mes¬ 
sage to Bayesian computation scientists was already in the title of this paper: leave 
Pima Indians alone, and more generally, let’s all refrain from now on from using 
datasets and models that are too simple to serve as a reasonable benchmark. 

To elaborate, let’s distinguish between specialised algorithms and generic al¬ 
gorithms. 

For algorithms specialised to a given model and a given prior (i.e. Gibbs samplers), 
the choice of a “benchmark” reduces to the choice of a dataset. It seems unfortunate 
that such algorithms are often showcased on small datasets (20 covariates or less), 
for which simpler, more generic methods perform much better. As a matter of fact, 
we saw in our simulations that even for bigger datasets Gibbs sampling does not 
seem to offer better performance than generic methods. 

For generic algorithms (Metropolis, HMC, and so on), the choice of a benchmark 
amounts to the choice of a target distribution. A common practice in papers pro¬ 
posing some novel algorithm for Bayesian computation is to compare that algorithm 
with a Gibbs sampler on a binary regression posterior for a small dataset. Again, 
we see from our numerical study that this benchmark is of of limited interest, and 
may not be more informative than a Gaussian target of the same dimension. If one 
wishes to stick with binary regression, then datasets with more than 100 covari¬ 
ates should be used, and numerical comparisons should include at least a properly 
calibrated random walk Metropolis sampler. 


7.3. Big data and the frontier. Several recent papers (Wang and Dunson 


2013 Scott et al. 2013 Bardenet et al. 20151 have approached the ’big data 


problem in Bayesian computation by focussing on the big n-p (many observations) 
scenario. In binary regression, and possibly in similar models, the big p problem 
(many covariates) seems more critical, as the complexity of most the algorithms 
we have discussed is O(npp^). Indeed, we do not believe that any of the methods 
discussed in this paper is practical for p ^ 1000. The large p problem may be 
therefore the current frontier of Bayesian computation for binary regression. 

Perhaps one way to address the large p problem is to make stronger approxima¬ 
tions; for instance by using EP with an approximation family of sparse Gaussians. 
Alternatively, one may use a variable selection prior that forbids that the number 
of active covariates is larger than a certain threshold. 
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7.4. Generalising to other models. We suspect some of our findings may apply 
more generally to other models (such as certain generalised linear models), but, of 
course, further study is required to assess this statement. 

On the other hand, there are two aspects of our study which we recommend 
to consider more generally when studying other models: parallelisation, and tak¬ 
ing into account the availability of fast approximations. The former has already 
been discussed. Regarding the latter, binary regression models are certainly not 
the only models such that some fast approximations may be obtained, whether 
through Laplace, INLA, Variational Bayes, or EP. And using this approximation to 
calibrate sampling-based algorithms (Hastings-Metropolis, HMC, SMC, and so on) 
will often have a dramatic impact on the relative performance of these algorithms. 
Alternatively, one may also discover in certain cases that these approximations are 
sufficiently accurate to be used directly. 
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